Robust Mirror Decent Algorithm for a Multi-Armed Bandit Governed by a Stationary Finite Markov Chain
نویسندگان
چکیده
Within the framework of “Value stream oriented process management” value stream mapping and the short-cyclic improvement routine are integrated into the organizational framework of process management in order to enable a methodically fostered improvement of value streams in different levels of detail. Therefore an advanced and sustainable continuous improvement process is enabled.
منابع مشابه
A penalized bandit algorithm
We study a two armed-bandit algorithm with penalty. We show the convergence of the algorithm and establish the rate of convergence. For some choices of the parameters, we obtain a central limit theorem in which the limit distribution is characterized as the unique stationary distribution of a discontinuous Markov process.
متن کاملFinite dimensional algorithms for the hidden Markov model multi-armed bandit problem
The multi-arm bandit problem is widely used in scheduling of traffic in broadband networks, manufacturing systems and robotics. This paper presents a finite dimensional optimal solution to the multi-arm bandit problem for Hidden Markov Models. The key to solving any multi-arm bandit problem is to compute the Gittins index. In this paper a finite dimensional algorithm is presented which exactly ...
متن کاملA Value Iteration Algorithm for Partially Observed Markov Decision Process Multi-armed Bandits
A value iteration based algorithm is given for computing the Gittins index of a Partially Observed Markov Decision Process (POMDP) Multi-armed Bandit problem. This problem concerns dynamical allocation of efforts between a number of competing projects of which only one can be worked on at any time period. The active project evolves according to a finite state Markov chain and generates then a r...
متن کاملReinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems
Multi-armed bandit tasks have been extensively used to model the problem of balancing exploitation and exploration. A most challenging variant of the MABP is the non-stationary bandit problem where the agent is faced with the increased complexity of detecting changes in its environment. In this paper we examine a non-stationary, discrete-time, finite horizon bandit problem with a finite number ...
متن کاملOn Robust Arm-Acquiring Bandit Problems
In the classical multi-armed bandit problem, at each stage, the player has to choose one from N given projects (arms) to generate a reward depending on the arm played and its current state. The state process of each arm is modeled by a Markov chain and the transition probability is priorly known. The goal of the player is to maximize the expected total reward. One variant of the problem, the so...
متن کامل